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DECLARATION UNDER 37 CFR §1.131 
I, David S. Taubman, hereby declare as follows. 



1 . I am the sole inventor of the subject matter recited in the pending claims of the 
above-identified patent application. 

2. Prior to December 17, 1999, 1 completed my invention as described and claimed 
in the subject application in this country, as evidenced by the following. 

a. Prior to December 17, 1999, 1 conceived the idea of an image data 
compression system that included a decomposition processor coupled to an 
arithmetic coder. In operation, the decomposition processor would decompose 
the image data into code-blocks of coefficients using a transform, where each 
code-block included a plurality of bit-planes from a most significant bit-plane to a 
least significant bit-plane. The arithmetic coder would form an encoded bit- 
stream by coding bit-planes of coefficient data in the code-blocks according to an 
arithmetic coding scheme. The arithmetic coder also would be constructed such 
» 

CERTIFICATE OF MAILING 

I hereby certify that this correspondence is being deposited with the United 
States Postal Service as First Class Mail in an envelope addressed to: 
Commission for Patents, PO Box 1450, Alexandria, VA 223 13-1450 on: 

March 14, 2006 

^Signature of person mailing papers) ~" 
Edouard Garcia 



(Typed or printed name of person mailing papers) 



Applicant : David S. Taubman Attorney's Docket No.: 10991918-2 

Serial No. : 10/631,884 
Filed : July 29, 2003 
Page : 2 of 4 

that coefficient data from at least one bit-plane is not subjected to said arithmetic 
coding scheme so as to be included in the encoded bit-stream without arithmetic 
coding. 

b. Prior to December 17, 1999, 1 made a physical embodiment of the image 
data compression system of U 2. a in the form of a computer program, and I 
operated this physical embodiment to carry out the steps of the associated method 
in a manner demonstrative of the workability of the idea of H 2. a. 

c. The reduction to practice of the physical embodiment of U 2.b is evidenced 
by the invention disclosure document entitled "Various improvements to the 
block entropy coder in the EBCOT image compression algorithm" that is attached 
hereto as Exhibit A. 

d. Page 3 of Exhibit A describes an image data compression approach in 
accordance with which "all of the binary symbols generated in the all of the 
binary symbols generated in the 'significance propagation' and 'magnitude 
refinement' coding passes (see Section II- 1.2 in the Attachment) representing bits 
in bit-planes p < p 0 - K are written directly into the bit-stream as raw binary digits, 
entirely bypassing the arithmetic coder, where p 0 denotes the most significant bit- 
plane in which any sample in the relevant code-block becomes significant and K 
is a parameter which we suggest should be set to K = 3." 

e. Page 3 of Exhibit A also describes the following modification that was 
made to the MQ coder in the physical embodiment of H 2.b: 

It should be pointed out that an additional modification to the Elias 
termination procedure was required in order to ensure that this 
"lazy" mode could be used in conjunction with the MQ coder. 
Specifically, since the MQ coder is byte-oriented, with a bit- 
stuffing rather than carry propagation policy for dealing with carry 
generation at the encoder, the arbitrary bit-stream suffices which 
can be generated by the emission of raw uncoded bits can generate 
illegal bit-stream for a previous MQ-coded pass. To avoid this 



06/03 '06 14:07 FAX 61 2 93855993 



UNSW ELECT ENG & TELECOM 



@ 002 



Applicant : David S. Taubman 
Serial No. : 10/631,884 
Filed : July 29, 2003 
Page : 3 of 4 



Attorney's Docket No.: 10991918-2 



difficulty, we modified the Elias termination implementation to 
allow for truly arbitrary suffices; the details of this modification 
are not warranted by the scope of this document, but they are 
adequately described by comments in the source code. 



The results of tests on the operation of the physical embodiment of 1J2.b 



are presented in Tables 1 and 2 on pages 5 and 6 of Exhibit A, respectively, and 
are summarized in the bulleted paragraphs at the end of page 6 of Exhibit A. 
These results demonstrated that, relative to the other compression approaches 
tested, the physical embodiment of f2.b substantially reduced the number of 
symbols which must be arithmetically coded at high bit-rates while maintaining 
comparable decompressed image quality for any given bit-rate. 

3. I declare that all statements made herein of my own knowledge are true and that 
all statements made on declaration and belief are believed to be true; and further that these 
statements were made with the knowledge that willful false statements and the like so made are 
punishable by fine or imprisonment, or both, under Section 1001 of Title 18 of the United States 
Code, and that such willful false statements may jeopardize the validity of the application or any 
patent issuing thereon. 



Respectfully submitted, 




David S. Taubman 
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Instructions: The information contained in this document is COMPANY CONFIDENTIAL and may not be disclosed to others without prior 
authorization. Submit this disclosure to the HP Legal Department as soon as possible. No patent protection is possible until a patent application is 

authorized, prepared, and submitted to the Government. 

Descriptive Title of Invention: ~ ~~ ~~ —————— 

"Various improvements to the block entropy coder in the EBCOT image compression algorithm" 



Name of Project: 
Product Name or Number: 



Was a description of the invention published, or are you planning to publish? If so, the date(s) and publication(s): 
Planning to distribute to JPEG2000 committee by 30 June 1999. 

Was a product including the invention announced, offered for sale, sold, or is such activity proposed? If so. the date(s) and location's): 
No 

Was the invention disclosed to anyone outside of HP, or will such disclosure occur? If so, the date(s) and name(s): 

Yes, various selected members of the JPEG2000 committee (University of Arizona. SAIC. Canon Research France. CISRA Australia. Sharp Labs 
USA. This limited disclosure was made on March 18. 1999 in Seoul. Korea. 



If any of the above situations will occur within 3 months, call your IP attorney or the Legal Department now at 1-553-3061 or 408-553-3061. 

Was the invention described in a lab book or other record? If so, please identify (lab book #, etc.) 



Was the invention built or tested? If so, the date: 
Yes, on June 1,1999. 

Was this invention made under a government contract? If so, the agency and contract number: ~~ ^ 

No 

Description of Invention: Please preserve all records of the invention and attach additional pages for the following. Each additional page should 
be signed and dated by the inventor(s) and witness(es). 

A. Prior solutions and their disadvantages (if available, attach copies of product literature, technical articles, patents, etc.). 

B. Problems solved by the invention. 

C. Advantages of the invention over what has been done before. 

D. Description of the construction and operation of the invention (include appropriate schematic, block, & timing diagrams; drawings; samples; 



graphs; flowcharts; computer listings; test results; etc.) 
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Various improvements to the block entropy coder in the EBCOT image 

compression algorithm. 



David Taubman 
Patent disclosure 

This invention addresses some modifications to the EBCOT image compression method (see invention 
disclosure "Image compression system based on EBCOT"). The modifications are described in the context 
of the system, which uses sub-blocks in the EBCOT coding engine. The detailed description of the EBCOT 
system with sub-blocks is given in the document "Reduced complexity entropy coding with sub-blocks", 
shown as an attachment to this disclosure. This document "Reduced complexity entropy coding with sub- 
blocks" is a draft of the submission to the ISO SC29WG01 committee working on the new image 
compression standard JPEG 2000. All the references to various sections in this Invention Disclosure refer 
to the sections on the attached JPEG 2000 document. 

This invention proposes a method, which both reduces the average number of arithmetically coded symbols 
and also reduces the maximum number of coding passes in which symbols might need to be arithmetically 
coded, which can be of advantage in simplifying hardware implementations. The idea is very simple and is 
connected with techniques used in Ricoh's "CREW" algorithm. Specifically, we begin by observing that 
the probability models for many coding contexts attain distributions close to uniform in the least significant 
bit-planes. It is a waste of effort to use the arithmetic coding engine to code these binary symbols; instead, 
we prefer to send them as raw binary digits. Although it is difficult to interleave raw binary digits into an 
arithmetically coded bit-stream, it is possible to bypass the arithmetic coding engine altogether for an entire 
sub-bitplane coding pass provided the arithmetic coder is terminated at the end of the previous coding pass 
(using the Elias termination which allows for unique decoding with arbitrary bit-stream suffices) and 
restarted at the beginning of the next pass which requires arithmetic coding. This is a subset of the 
behaviour offered by the "-Crestart" option and for our experiments (see Attachment) so the software 
implementation supports the behaviour both with and without the costly reinitialization of coding context 
states which goes with the "-Crestart" mode. For our experiments, however, we prefer to use this "lazy" 
mode only in conjunction with "-Crestart" and sub-block causal contexts for reasons which will be 
explained shortly. To be specific, in the proposed, optional modification, all of the binary symbols 
generated in the "significance propagation" and "magnitude refinement" coding passes (see Section II-1.2 

in the Attachement) representing bits in bit-planes p < p 0 — K are written directly into the bit-stream as 
raw binary digits, entirely bypassing the arithmetic coder, where p 0 denotes the most significant bit-plane 
in which any sample in the relevant code-block becomes significant and K is a parameter which we 
suggest should be set to K = 3 . It should be pointed out that an additional modification to the Elias 
termination procedure was required in order to ensure that this "lazy" mode could be used in conjunction 
with the MQ coder. Specifically, since the MQ coder is byte-oriented, with a bit-stuffing rather than carry 
propagation policy for dealing with carry generation at the encoder, the arbitrary bit-stream suffices which 
can be generated by the emission of raw uncoded bits can generate illegal bit-stream for a previous MQ- 
coded pass. To avoid this difficulty, we modified the Elias termination implementation to allow for truly 
arbitrary suffices; the details of this modification are not warranted by the scope of this document, but they 
are adequately described by comments in the source code. 

One advantage of the modification is that it substantially reduces the number of symbols which must be 
arithmetically coded at high bit-rates. Also, since we usually encode all code-blocks in an image at a high 
rate before truncating down to a final target bit-rate, this scheme substantially reduces the number of 
symbols which must typically be encoded and hence reduces the encoding time. We find, for example, that 
CPU times for reversible compression are typically reduced by 30%. On the other hand, the modification 
has relatively little effect on compression performance. 

The second advantage of the modification is that it substantially reduces the maximum number of coding 
passes in which arithmetic coding might need to be used. Without the modification, the maximum number 

of coding passes for any given code-block is 3/*^ — 2 where P lTax is the maximum number of bit-planes 



in any given subband and might be on the order of 12 for the lower frequency subbands. On the other 
hand, with the modification, the maximum number of coding passes for any given code-block is 
^max + 2Jf = + 6 . In applications where we intend to use microscopic parallelism to achieve 
"sample-per-clock" throughput, this means a substantial reduction in the number of parallel arithmetic 
coding engines which must be included on the chip. Due to the importance of the combination of this 
mode with parallel encoding/decoding, we provide results in Table 1 for the combination of this so-called 
"lazy" option with the "-Crestart" option and sub-block causal coding contexts, comparing the performance 
with the use of sub-block causal coding contexts and "-Crestart" alone. 



Table 1 Comparison of the "lazy coding" option in combination 
with the parallel options with the parallel options alone, for 
a sub-block size of 16x16 and a code-block size of 64x64. 
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To conclude this section, it 
different modifications and 



is worthwhile comparing the lossless performance associated with the various 
options which have been advanced. Table 2 provides this comparison in terms 



of lossless bit-rate and total number of arithmetically coded symbols per sample, for five different 
algorithms. 



Table 2 Comparison of lossless coding performance for five 
different algorithms (mode variations) using 64x64 code- 
blocks and 16x16 sub-blocks where applicable with the 5/3 
default reversible Wavelet kernel. The first pair of columns 
refer to VM4 (EBCOT); the second pair refer to the coder 
obtained by applying the modifications described in Sections 
//- / and 11-2; the third pair of columns is obtained by adding 

the parallel options ("-Cres tart" and sub-block causal 
context formation); the fourth pair of columns are obtained 

by adding the "lazy" coding option; and the last pair of 
columns are obtained using the "option" coder from VM4 
(EBCOT) with "-Ccausal". 
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Some of the interesting points to observe are as follows: 

♦ The "lazy" coding mode requires far fewer (usually less than half as many) symbols to be coded than 
any of the other modes. 

♦ The "lazy" coding mode generates a lower compressed bit-rate than that obtained with the same 
parallel options but the lazy mode turned off. Since the lazy mode affects only the significance 
propagation and magnitude refinement coding passes all of whose coding contexts are initialized to the 
MQ coder's standard initial state at the beginning of the "learning curve", this result indicates that the 
relevant distributions must be so close to uniform that emitting raw binary digits is more efficient than 
letting the arithmetic coder learn this uniform distribution. 

♦ The modifications to the original EBCOT algorithm described in Sections II- 1 and II-2. have 
negligible effect on lossless performance. 
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Reduced Complexity Entropy Coding 
with Sub-Blocks 



Partners: UNSW, CISRA, HP, U.Arizona/SAIC, CRF, Sharp Laboratories of America 



In this core experiment we explore the complexity-performance space associated with the EBCOT block 
coding engine [WG1N1020R] in VM4. VM4 currently contains two embedded block entropy coding 
engines: the original entropy coder from the EBCOT algorithm which provides the framework for VM4; 
and an "option" coder [WG IN 1201], which provides an alternative, closely related entropy coding engine 
for the same framework. In some respects, the "option" coder involves some relatively minor 
modifications of the EBCOT entropy coder: elimination of one of the four EBCOT coding passes; and no 
transposition of code-blocks from the HL band. On the other hand, the "option" coder introduces some 
additional options which may have significant benefit to efficient hardware implementations: i.e. optional 
modes to enable parallel encoding and/or decoding and the possibility of implementations with reduced 
external memory consumption. On the other hand, the "option" coder also eliminated the use of sub-blocks 
within each code-block and the associated sub-block significance coder from the EBCOT algorithm. In 
fact the "option" coder was implemented by forcing the sub-block size in the EBCOT algorithm equal to 
the code-block size. During the Seoul meeting it became apparent to the authors that the sub-block 
framework need not be eliminated in order to realize the benefits available from the various modes offered 
by the "option" coder. In fact, with the aid of sub-blocks, we conjectured that it should be possible to 
maintain a lower implementation complexity and/or CPU execution time, while reducing some of the 
performance penalty associated with the "option" coder. This report and the associated source code 
represent the result of our investigations into these issues. Specifically, the goal here is to quantify the 
impact of a variety of incremental modifications and/or mode switches on the original EBCOT algorithm 
and to recommend a cohesive block-based entropy coding algorithm which possesses a superior range of 
complexity performance trade-offs to those which exist in VM4 today. In so doing, we put forth a 
recommendation which merges the best features of the two entropy coding algorithms in VM4. 
It should be stressed that there are few innovations in this work. In fact, the similarities between the 
various entropy coding algorithms that we are considering far outweigh their differences. This is also 
evident from the similarity between their implementations which means that we are proposing only 
relatively minor changes to the VM4 software. The purpose of this work is thus to assemble a body of 
empirical evidence to enable us to settle on the fine details of the entropy coding algorithm as we approach 
the point of no return in the JPEG2000 standardization process. Specifically, we are concerned primarily 
with minimizing the following three sources of complexity, while sacrificing as little as possible in 
compression performance. 

a) The total number of symbols which must be arithmetically encoded/decoded per original image pixel. 
While this figure does embody all aspects of complexity, it does represent a substantial portion of the 
implementation cost, particularly for hardware solutions. In this arena we explore mechanisms which 
reduce the average number of symbols which must be coded as well as mechanisms for reducing the 
maximum number of symbols which must be coded for any given subband sample. 



b) Opportunities for parallelism. The independent block coding principle in EBCOT ensures that separate 
code-blocks can always be encoded/decoded in parallel, which we might understand as "macroscopic" 
parallelism. This type of parallelism is ideally suited to software implementations on multi-processor 
architectures and provides a simple, if expensive mechanism for increasing throughput in hardware 
implementations. Here we explore opportunities for "microscopic" parallelism along the lines of the 
"option" coder [WG1N1201], where the arithmetic coding contexts and codeword generation state 
variables are reset at the boundaries of each coding pass. This microscopic parallelism provides a 
mechanism for improving throughput without replicating the resources required to store multiple code- 
blocks in local memory. Microscopic parallelism of this form is probably most relevant for efficient 
hardware implementations. 

c) Elimination of redundant coding passes and/or coding contexts whose cost in terms of implementation 
complexity does not justify the small gains in compression performance which result. 

As one might expect, the above goals lead to a family of trade-offs between performance and complexity 
and it is important that we consider only the most compatible points in this complexity-performance trade- 
off so as to ensure that hardware and software implementations can support the full range of trade-offs 
(probably with different throughputs and/or power consumption) with comparative ease. 
It turns out that the sub-block coding paradigm in EBCOT is of some significant assistance in containing 
the number of symbols which must be coded and also practical CPU execution times, which is one of the 
reasons for its original introduction in the entropy coding engine in VM4. On the other hand, there have 
been some questions regarding whether or not the quad-tree approach to coding the sub-block significance 
information is appropriate for JPEG2000. For this reason, we also investigate alternative approaches to 
coding the sub-block significance information. 

In discussing the various modifications made to the EBCOT entropy coder in the following section, we 
provide summary comparative results in terms of the following five images: 

♦ Lenna (512x512) - this is included because it is the most common image used for comparing 
compression performance in journal papers and because it is a small image with which quick 
comparisons can be generated by others. It is not included in the averages. 

♦ Aerial2 (2048x2048) - one of the four most commonly used JPEG2000 test images. This image rarely 
shows substantial performance differences between different entropy coding algorithms, mainly 
because it is so noisy that schemes to exploit structure in the image are relatively meritless. We do not 
include this image in the averages because it invariably biases all entropy coding comparisons toward 
0 difference. 

♦ Bike (2048x2560) ~ one of the four most commonly used JPEG2000 test images. This is included in 
the average. 

♦ Cafe (2048x2560) - one of the four most commonly used JPEG2000 test images. This is included in 
the average. 

♦ Woman (2048x2560) - one of the four most commonly used JPEG2000 test images. This is included 
in the average. 

The 6-layer scalable mode is used with Daubechies 9/7 non-reversible filter kernels. This mode should by 
now be familiar to most readers. It is obtained with the following compression options, "-prof sp -Clayers 
0.0625 0.125 0.25 0.5 1.0 -rate 2.0" along with whatever other arguments are required to enable specific 
options for the test. These incremental results which appear in this report do not necessarily fulfill the 
requirements for reporting core experiment results. A separate spreadsheet will be attached to fulfill that 
requirement, reporting more comprehensive test results only for the key modes whose recommendation 
comes out of this report. A discussion of the separate results to be reported on attached spreadsheets 
appears in Section IV. 

We conclude this section by pointing out that the framework in which our implementation has been made 
and all associated results reported is that of a pre-release of the VM4.1 software, which is more up-to-date 
than any of the beta releases, but not current with the actual VM4. 1 release. There are two important 
differences between the code we have worked with and VM4.1: 

1 ) We have not used the new syntax by Ricoh, which became mandatory with the final release of VM4. 1 . 
This avoids the main difficulty with the new syntax in that it does not easily allow new quantities to be 
recorded in the global header of the bit-stream for ease of experimentation. 

2) The "option" coder was actually changed immediately before the release of VM4.1 to use a different 
run-length coding method to that in the original EBCOT coder. In particular, this so-called "speed up" 
mode exploits properties of the arithmetic coding procedure (which has been modified accordingly) to 



encode and/or decode multiple symbols at a time under some circumstances. Unfortunately, this 
modification became available after most of the results for this report had been generated. In fact, we 
summarize complexity in this report almost entirely in terms of the number of symbols which are 
coded, since the "option" coder used exactly the same pimitives as the EBCOT coder. With the "speed 
up" mode, the primitives are not identical and the symbol counting procedure has been broken 
(symbols are not counted at all within the "run" mode). It is not clear how symbols should be counted 
in the run mode and it is also not clear whether this mode should be added to the coder proposed in this 
document. For this reason, we choose to rely upon the earlier incarnation of the "option" coder for all 
the symbol count results presented here; however, to provide as much information as possible 
concerning the performance of the current version of the "option" coder, we provide comparative CPU 
execution times with the latest version in Section III. This appears to be the best way to deal with the 
fact that there is no reliable way to count symbols for the "fast" run mode and this mode has not yet 
been investigated within the setting of sub-block based coding. 



n A Sequence of Modifications to the EBCOT 



Coder -.- , - ^' ' ' ": 

We begin with the EBCOT entropy coding engine from VM4 and express the entropy coding variations 
investigated in terms of a sequence of incremental modifications of this coder, along with the 
corresponding experimental results. In this way, we avoid the complexities of describing the entropy coder 
in detail. Only the differences from the coder described in [WG1N1020R] are actually identified here. 



II-l Preliminary Simplifications and Optimizations 

We begin by eliminating some of the less useful features of the EBCOT entropy coder which contribute 
somewhat toward complexity. 

11-1.1 No "Far" neighbours 

Firstly, we eliminate the so-called "Far Neighbourhood" context state which is used by the zero coding 
primative when all 8 immediate neighbouring samples within the code-block are insignificant, but one or 
more of the six neighbours which reside 2 samples to the left and/or right and within 1 sample vertically is 
already significant. This eliminates one of the 10 zero coding model contexts and reduces complexity 
somewhat. It has no apparent effect on software running time although it does introduce some 
simplifications for hardware implementations since the coding context now depends on 8 rather than 14 
neighbours and there is one less context to be stored (contexts might need to be kept in very fast register 
storage, which is much more expensive than on-chip SRAM). The effect on PSNR performance is on the 
order of about 0.01 dB and not worth reporting here. 

77-2.2 Reduction from 4 to 3 Coding Passes per Bit-Plane 

The original EBCOT embedded block coder specifies 4 coding passes per bit-plane, designated as follows: 

♦ "Forward significance propagation pass", f^ p1 . We preserve this as is, except that we collect all 
subband samples for which any of the eight immediate neighbours have already been found to be 
significant into this pass to compensate for the fact that we are dropping the reverse pass, below. 

♦ "Backward significance propagation pass", P p2 . We eliminate this coding pass. 

♦ "Magnitude refinement pass", P p * 3 . We preserve this as is. 

♦ "Normalization pass", P p ' 4 . We preserve this as is. 

In all other respects, the description in [WG1N1020R] remains unchanged. The reason for dropping the 
reverse pass is that it contributes very little to overall compression performance with most images and adds 
to the implementation complexity and software running time (although not to the number of symbols which 
must be coded). It is also partly incompatible with the options discussed in Section II-3 to increase 
opportunities for parallelism, although this could have been circumvented by using a reverse scan only 



within sub-blocks, if it had turned out to be sufficiently important. Table 3 shows the comparison between 
the performance of VM4 and the modified coder obtained by removing the backward significance 
propagation pass and the "Far" neighbourhood context. 



Table 3 Comparison ofVM4 with coder obtained by introducing 
modifications described in Sections II -1.1 and II - 1.2 



LenWa (512x512) 















0.06174 


0.093 


28.19 


0.06229 (+0.9) 


0.091 (-2.2) 


28.21 (+0.02) 


0.12482 


0.189 


31.10 


0.12433 (-0.4) 


0.180 (-4.8) 


31.06 (-0.04) 


0.24982 


0.382 


34.20 


I 0.24899 (-0.3) 


0.367 (-4.0) 


34.17 (-0.03) 


0.49905 


0.749 


37.33 


0.49933 (+0.1) 


0.720 (-3.8) 


37.32 (-0.01) 


0.99942 


1.430 


40.43 


0.99994 (+0.1) 


1 .378 (-3.6) 


40.44 (+0.00) 


1.99094 


2.590 


44.92 


1.99579 (+0.2) 


2.513 (-3.0) 


44.92 (-0.00) 




Aerial 2 (2048x2048-4- A45 














0.06248 


0.094 


24.66 I 


0.06230 (-0.3) 


0.090 (-4.1) 


24.63 (-0.03) 


0.12484 


0.192 


26.53 


0.12441 (-0.3) 


0.185 (-3.6) 


26.51 (-0.02) 


0.24995 


0.364 


28.58 


0.24999 (+0.0) 


0.350 (-3.8) 


28.58 (-0.00) 


0.49934 


0.694 


30.63 


0.49982 (+0.1) 


0.674 (-3.0) 


30.63 (-0.00) 


0.99969 


1.349 


33.27 


0.99930 (-0.0) 


1 .303 (-3.4) 


33.24 (-0.02) 


1.99812 


2.514 


38.11 


1.99850 (+0.0) 


2.445 (-2.7) 


38.10 (-0.01) 




m Bike (2048x2560 
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0.06215 


0.120 


23.81 


0.06249 (+0.5) 


r 0.115 (-3.4) 


23.80 (-0.02) 


0.12483 


0.229 


26.41 


0.12496 (+0.1) 


0.220 (-3.9) 


26.36 (-0.05) 


0.24985 


0.440 


29.66 


0.24994 (+0.0) 


0.423 (-3.9) 


29.62 (-0.04) 


0.49941 


0.836 


33.54 


0.49986 (+0.1) 


0.802 (-4.1) 


33.51 (-0.03) 


0.99997 


1.552 


38.12 


0.99972 (-0.0) 


1.488 (-4.1) 


38.08 (-0.03) 


1.99589 


2.825 


44.03 


1.99894 (+0.2) 


2.723 (-3.6) 


44.03 (-0.01) 




Cafe (2048x2560 








mmmmmm 








0.06246 


0.101 


19.07 


0.06240 (-0.1) 


0.098 (-3.2) 


19.05 (-0.02) 


0.12493 


0.212 


20.81 


0.12462 (-0.2) 


0.204 (-3.6) 


20.75 (-0.06) 


0.24985 


0.405 


23.16 


0.24971 (-0.1) 


0.389 (-4.1) 


23.13 (-0.03) 


0.49957 


0.765 


26.83 


0.49982 (+0.1) 


0.738 (-3.5) 


26.80 (-0.03) 


0.99999 


1.485 


32.06 


0.99995 (-0.0) 


1.427 (-3.9) 


32.03 (-0.03) 


1.99883 


2.759 


39.12 


1.99894 (+0.0) 


2.661 (-3.6) 


I 39.07 (-0.05) 


Woman (2048x2560) 
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0.06222 


0.094 


I 25.63 


! 0.06248 (+0.4) 


0.091 (-2.9) 


25.63 (+0.00) 


0.12490 


0.193 


27.39 


I 0.12471 (-0.2) 


0.185 (-4.2) 


27.36 (-0.03) 


0.24983 


0.373 


30.02 


0.24984 (+0.0) 


0.359 (-4.0) 


30.01 (-0.01) 


0.49965 


0.711 


i 33.66 


0.49896 (-0.1) 


0.685 (-3.6) 


33.63 (-0.03) 


0.99998 


1.367 


! 38.45 


0.99964 (-0.0) 


1.324 (-3.1) 


| 38.44 (-0.01) 


1.99885 


2.598 


44.05 


1.99980 (+0.0) 


2.521 (-3.0) 


44.04 (-0.01) 






Averaqe (Bike. Caf 6; Woman) 














0.06228 


0.105 


22.84 


I 0.06246 (+0.3) 


0.101 (-3.2) 


22.82 (-0.01) 


0.12489 


0.211 


24.87 


0.12476 (-0.1) 


0.203 (-3.9) 


24.83 (-0.04) 


0.24985 


0.406 


27.61 


0.24983 (-0.0) 


0.390 (-4.0) 


27.59 (-0.03) 


0.49954 


0.771 


31.34 


0.49955 (+0.0) 


0.742 (-3.7) 


31.31 (-0.03) 


0.99998 


1.468 


36.21 


0.99977 (-0.0) 


1.413 (-3.7) 


36.18 (-0.02) 


1.99786 


2.727 


42.40 


1.99923 (+0.1) 


2.635 (-3.4) 


42.38 (-0.02) 



II-1.3 No Block Transposition 

In the original EBCOT entropy coder, each code-block from the HL subband (horizontally high-pass, 
vertically low-pass) subband is transposed so that both LH and HL code-blocks (the most significant) 
contain predominantly vertical edge features. This somewhat simplified the implementation and testing in 
software and also reduced the complexity when the "Far Neighbourhood" context was in use. On the other 
hand, if we remove the "Far" neighbourhood context, as described above, this transposition step could do 
more harm than good to the overall complexity. More importantly, the transposition operation prevents us 
from deriving implementations with reduced external memory consumption as discussed in Section II-3. 
The penalty for eliminating transposition is that the run-length coding mode in EBCOT might not respond 
so well to the HL subband's code-blocks so that we might expect poorer performance. Table 4, however, 
shows that this impact is negligible. 



Table 4 Comparison ofVM4 with coder obtained by introducing 
modifications described in Sections 11- L J, II -1.2 and II -1.3. 







assails 


mmim&mm 






0.06174 


0.093 


28.19 


0.06152 (-0.3) 


0.091 (-2.3) 


28.18 (-0.01) 


0.12482 


0.189 


31.10 


0.12488 (+0.0) 


0.183 (-3.1) 


31 .06 (-0.05) 


0.24982 


0.382 


34.20 


0.24915 (-0.3) 


0.372 (-2.8) 


34.16 (-0.04) 


0.49905 


0.749 


37.33 


0.49942 (+0.1) 


0.729 (-2.7) 


37.30 (-0.03) 


0.99942 


1.430 


40.43 


0.99954 (+0.0) 


1 .388 (-2.9) 


40.42 (-0.02) 


1 .99094 


2.590 


44.92 


1.99225 (+0.1) 


2.521 (-2.7) 


44.88 ^-0.04) 






Aerial 2 (2046x2048+ A45 
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0.06248 


0.094 


24.66 | 


0.06246 (-0.0) I 


0.090 (-3.6) 


24.63 (-0.02) 


0.12484 


0.192 


26.53 


0.12425 (-0.5) 


0.185 (-3.3) 


26.50 (-0.03) 


0.24995 


0.364 


28.58 


0.25000 (+0.0) 


0.351 (-3.4) 


28.58 (-0.00) 


0.49934 


0.694 


30.63 


0.49978 (+0.1) 


0.675 (-2.7) 


30.63 (-0.00) 


0.99969 


1.349 


33.27 | 


0.99996 (+0.0) 


1 .306 (-3.2) 


33.25 (-0.02) 


1.99812 


2.514 


38.11 I 


1 .99762 (-0.0) 


2.446 (-2.7) 


38.10 (-0.01) 






::; Biker (2048x2560k - _ 














0.06215 


0.120 


23.81 








0.12483 


0.229 


26.41 


1 0.12486 (+0.0) 


0.224 (-2.4) 


26.34 (-0.07) 


0.24985 


0.440 


29.66 


0.24978 (-0.0) 


0.429 (-2.5) 


29.61 (-0.05) 


0.49941 


0.836 


33.54 


0.49998 (+0.1) 


0.812 (-2.9) 


| 33.50 (-0.031 


0.99997 


1.552 


38.12 


0.99989 (-0.0) 


1.504 (-3.1) 


38.08 (-0.03) 


1.99589 


2.825 


44.03 


1.99885 (+0.1) 


2.739 (-3.0) 


j 44.02 (-0.01) 






Cafe 42048x2560) ' 




mimmnMsm 








fi^f^tf*+^B# 


0.06246 


0.101 


19.07 


0.06230 (-0.3) 


0.099 (-2.0) 


! 19.04 (-0.02) 


0.12493 


0.212 


20.81 


0.12458 (-0.3) 


0.207 (-2.2) 


20.75 (-0.06) N 


0.24985 


0.405 


23.16 


0.24995 (+0.0) 


0.395 (-2.5) 


23.13 (-0.03) 


0.49957 


0.765 


26.83 


0.50000 (+0.1) 


0.748 (-2.2) 


26.79 (-0.03) 


0.99999 


1.485 


32.06 


0.99928 (-0.1) 


| 1.441 (-2.9) 


32.02 (-0.05) 


1.99883 


2.759 


39.12 


1.99925 (+0.0) 


2.680 (-2.9) 


39.07 (-0.05) 



Woman (2048x2560) 



0.06222 


0.094 


25.63 


0.06243 (+0.3) 


0.092 (-2.7) 


25.62 (-0.01) 


0.12490 


0.193 


27.39 


0.12462 (-0.2) 


0.186 (-3.7) 


27.35 (-0.03) 


0.24983 


0.373 


30.02 


I 0.24986 (+0.0) 


0.360 (-3.7) 


30.00 (-0.02) 


0.49965 


0.711 


33.66 


I 0.49977 (+0.0) 


0.688 (-3.2) 


33.63 (-0.02) 


0.99998 


1.367 


38.45 


! 1.00000 (+0.0) 


1 .329 (-2.8) 


38.43 (-0.01) 


1.99885 


2.598 


44.05 


| 1.99844 (-0.0) 


2.526 (-2.8) 


44.03 (-0.02) 




Averaae (Bike, Cafe, Woman) • 












0.06228 


o.'ibs 


22.84 


0.06239 (+0.2) 


0.103 (-2.0) 


22.82 (-0.02) 


0.12489 


0.211 


24.87 


0.12469 (-0.2) 


0.206 (-2.7) 


24.81 (-0.06) 


0.24985 


0.406 


27.61 


0.24987 (+0.0) 


0.395 (-2.9) 


27.58 (-0.04) 


0.49954 


0.771 


31.34 


0.49992 (+0.1) 


0.750 (-2.7) 


31.31 (-0.03) 


0.99998 


1.468 


36.21 


0.99972 (-0.0) 


1.425 (-2.9) 


u 36.18 (-0.03) 


1.99786 


2.727 


42.40 


1 .99885 (+0.0) 


2.649 (-2.9) 


42.37 (-0.03) 



II- 1.4 Skewed Initialization for the Arithmetic Coder 

The original EBCOT coder uses the default initial state for all context models in the conditional arithmetic 
coding process. Some benefit can be gained by initializing the run-length and all-zero contexts with 
appropriately skewed distributions, as was done for the option coder in [WG IN 1201]. Also, for symbols 
with an assumed uniform distribution the MQ coder introduced from Seoul uses a separate context, which 
was inappropiately initialized in VM4. Nevertheless, preferential initialization of the arithmetic coder 
context states has the most substantial effect when the coder is restarted at the beginning of each coding 
pass as described in Section II-3. We take the tuned initial states for this case and apply them to the coder 
with the modifications described in the previous three sections so as to move toward the starting point in a 
family of very closely related coders which is developed in the following sections. The initializations 
themselves are shown below; the results shown in Table 5 indicate that the initializations which are tuned 



Run-length coding context (for runs of four samples, all currently 
insignificant, with all neighbours of each sample currently insignificant) 


MQ coder state 6 

(Probability of " 1" = 0.042; on the 
"learning curve") 


All zero neighbourhood context (for samples which are currently 
insignificant, with all neighbours insignificant, which do not qualify as 
members of a run) 


MQ coder state 8 

(Probability of "1" = 0.020; on the 
"learning curve") 


Assumed uniform context (for symbols which were coded without any 
adaptive context prior to the introduction of the MQ coder) 


MQ coder state 28 

(Probability of H 1 " = 0.34; not on 

the "learning curve") 
This context is re-initialized at the start 
of each coding pass. 


All other context models 


MQ coder state 0 

(Probability of "1" = 0.34; at start of 
"learning curve") 



Table 5 Comparison ofVM4 with coder obtained by introducing 
modifications described in Sections 11-1.1, II-L2, 11- 1. 3 and 

11-1.4, 













0.06174 


0.093 


28.19 


0.06171 (-0.1) 


0.091 (-2.3) I 


28.18 (-0.01) 


0 12482 


0.189 


31.10 


0.12491 (+0.1) 


0.183 (-3.1) 


31.06 (-0.05) 


O P4Q82 


0.382 


34.20 


0.24973 (-0.0) 


0.373 (-2.4) 


34.17 (-0.03) 




0 74Q 


37.33 


0.49976 (+0.1) 


0.730 (-2.6) 


37.31 (-0.02) 


rj QQQ4P 


1 .430 


40.43 | 


0.99973 (+0.0) 


1.388 (-2.9) 


40.41 (-0.03) 


1 .99094 


2.590 


44.92 


1.99149 (+0.0) 


2.521 (-2.7) 


44.88 (-0.04) 
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0.06248 


0.094 


24.66 


0.06248 (+0.0) 


0.091 (-3.5) 


24.63 (-0.02) 


0.12484 


0.192 


26.53 


0.12490 (+0.1) 


0.186 (-3.0) 


26.51 (-0.02) 


0.24995 


0.364 


28.58 


0.24993 (-0.0) 


0.352 (-3.4) 


28.58 (-0.00) 


0.49934 


0.694 


30.63 


0.49986 (+0.1) 


0.676 (-2.7) 


30.63 (-0.00) 


0.99969 


1.349 


33.27 


0.99996 (+0.0L 


1.307 (-3.1) 


33.25 (-0.01) 


1.99812 


2.514 


38.11 


1.99737 (-0.0) 


2.446 (-2.7) 


38.10 (-0.01) 


BB 
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0.06215 


f 0.120 


23.81 


rTrrr irPrvr*^ i 
0.06242 (+0.4) 


0.118 (-1.6) 


I 23.79 (-0.03) 


0.12483 


I 0.229 


26.41 


0.12465 (-0.1) 


0.224 (-2.5) 


I 26.34 (-0.08) 


0.24985 


| 0.440 


29.66 


0.24975 (-0.0) 


0.428 (-2.6) 


| 29.61 (-0.05) 


0.49941 


0.836 


33.54 


0.49991 (+0.1) 


0.812 (-2.9) 


I 33.50 (-0.03) 


0.99997 


1.552 


38.12 


0.99992 (-0.0) 


1.504 (-3.1) 


| 38.08 (-0.03) 


1.99589 


I 2825 


| 44.03 


1.99944 (+0.2) 


2.741 (-3.0) 


( 44.03 (-0.01) 



0.06246 



0.101 



Cafe (2048x2560) 



19.07 



0.06230 (-0.3) 



0.099 (-1 .9) 



19.05 (-0.02) 



0.12493 



0.212 



20.81 



0.12462 (-0.3) 



0.207 (-2.2) 



20.75 



(-0.06) 



0.24985 



0.405 



23.16 



0.24948 (-0.1) 



0.394 (-2.7) 



23.12 (-0.04) 



0.49957 



0.765 



26.83 



0.49992 (+0.1) 



0.748 (-2.2) 



26.79 (-0.04) 



0.99999 



1.485 



32.06 



0.99906 (-0.1) 



1.441 (-2.9) 



32.02 (-0.05) 



1.99883 



2.759 



39.12 



1.99651 (-0.1) 



2.678 (-2.9) 



39.06 (-0.06) 



0.06222 



0.094 



Woman (2048x2560) 



25.63 If 0.06233 (+0.2) 0.091 (-2.9) 



25.62 (-0.01) 



0.12490 



0.193 



27.39 



0.12483 (-0.1) 



0.187 (-3.4) 



27.36 (-0.03) 



0.24983 



0.373 



30.02 



0.24971 (-0.0) 



0.360 (-3.7) 



30.00 (-0.03) 



0.49965 



0.711 



33.66 



0.49993 (+0.1) 



0.688 (-3.2) 



33.63 (-0.03) 



0.99998 



1.367 



38.45 



0.99989 (-0.0) 



1.329 (-2.8) 



38.43 (-0.02) 



1.99885 



2.598 



44.05 



1.99481 (-0.2) 



2.523 (-2.9) 



44.02 



(-0.03) 
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0.06228 


0.105 


22.84 


0.06235 (+0.1) 


0.103 (-2.1) 


22.82 (-0.02) 


0.12489 


0.211 


24.87 


0.12470 (-0.2) 


0.206 (-2.7) 


24.82 (-0.06) 


0.24985 


0.406 


27.61 


0.24965 (-0.1) 


0.394 (-3.0) 


27.57 (-0.04) 


0.49954 


0.771 


31.34 


i 0.49992 (+0.1) 


0.750 (-2.7) 


31.31 (-0.03) 


0.99998 


1.468 


36.21 


0.99962 (-0.0) 


1.425 (-2.9) 


36.18 (-0.03) 


1 .99786 


2.727 


42.40 


1.99692 (-0.0) 


2.647 (-2.9) 


42.37 (-0 04) 



II-2 Alternatives to Quad-Tree Coding of Sub-Block 
Significance 

In the original EBCOT algorithm the significance of each sub-block within a code-block is coded using a 
simple embedded quad-tree code. Here we develop an alternative method based on the techniques used for 
bit-plane coding. Specifically, at the beginning of each bit-planei we use the arithmetic coding engine 
rather than a quad-tree code to signal the significance of each sub-block relative to that bit-plane. We 
simply scan through all the sub-blocks in the code-block in the usual lexicographical order, skipping over 
those sub-blocks which are already known to be significant and emitting a single binary symbol to identify 
the significance of each of the other blocks ("1" if it becomes significant, "0" otherwise). The symbol is 
coded in one of two different contexts as follows: 

♦ If any of the four immediate neighbours (above, below, to the left or to the right) of the sub-block 
which lie within the same code-block have already been found to be significant, we use the "assumed 
uniform" coding context, which is reinitialized to the usual value (MQ coder state 28) at the beginning 
of the scan; remember that this context is reinitialized at the start of every coding pass, so reusing for 
the quad-tree code does not interfere with any of the other coding operations. 

♦ Otherwise, when all four immediate neighbours are still insignificant, we use a special coding context 
which is initialized to MQ coder state 4 (a slightly skewed state near the start of the "learning curve") 
at the commencement of the scan. 

The bit-plane coding operations are interleaved into the bit-stream in exactly the same manner and sub- 
block significance has exactly the same interpretation as in the original EBCOT algorithm described in 
[WG1N1020R]. We find that this algorithm gives essentially the same coding performance as the quad- 
tree coder, with a negligible improvement of between 0.0 ldB and 0.02dB. Table 6 provides comparative 
performance figures for the quad-tree coder and the so-called scan-subs coder in the default case of 16x16 
sub-blocks and 64x64 code-blocks. Of more interest is the comparison shown in Table 7 for 8x8 sub- 
blocks and 64x64 code-blocks, since in this case the sub-block significance code comprises a greater 
proportion of the overall bit-rate. 



Table 6 Comparison of the embedded quad-tree with the scan- 
subs method for coding sub-block significance. In both 
cases, the modifications described in Sections II-l.l, II-1.2, 
II- 1.3 and 11-1.4 are included. Here the default sub-block 
and code-block sizes of 16x16 and 64x64 respectively are 
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Table 7 Comparison of the embedded quad-tree with the scan- 
subs method for coding sub-block significance. In both 
cases, the modifications described in Sections II- 1.1, II -1.2, 
11-1.3 and 11-1.4 are included. Here the sub-block and code- 
block sizes are 8x8 and 64x64, respectively. 
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II-3 Options to Enable Parallelism 

As with any block coder, it is always possible to compress or decompress any number of blocks in parallel 
and indeed this is most likely the easiest way to exploit parallelism. We refer to this as macroscopic 
parallelism, because it does not require tight synchronization between the respective coding engines and 
can be realized in multi-threaded software implementations with comparative ease. A second opportunity 
for parallelism arises if the coding passes for any given block can also be performed in parallel. This is 
enabled if separate probability model contexts are maintained for each coding pass within each code-block. 
Of course, there is some additional adaptation overhead in this case, but this overhead is minimized by the 
fact that we only visit those sub-blocks which are known to be significant already. Moreover, we will need 
to terminate the arithmetic code word for each coding pass in such a way as to ensure that the decoder can 
recover the termination point implicitly during the decoding process and so know where each subsequent 
coding pass's bit-stream begins. This involves an average overhead of approximately 1.5 bits. Finally, if 
we want to enable parallel decoding, as well as encoding, the length of the bit-stream segment 
corresponding to each coding pass must be explicitly identified, rather than implicitly determined by the 
decoding process itself and we must also restrict the non-causal context formation model used in the 
original EBCOT algorithm. These conditions and their solution are virtually identical to those associated 
with the "option ,, coder in VM4 [WG1N1201], with the following exceptions: 

♦ There is no need to depart from the sub-block paradigm, as done in the "option" coder, or to entirely 
abandon the non-causal context models. Instead, we obtain a solution with higher compression 
performance and a reduced number of symbols to be arithmetically encoded/decoded by modifying the 
coder described above so that the coding contexts are "sub-block causal". That is, no sample in one 
sub-block may be coded with respect to a context formed using samples from a later sub-block in the 
lexicographical scan pattern. This affects the coding contexts only for those samples which lie on the 
lower and right hand boundaries of each sub-block. In fact, the modification is so minor that our 
implementation does not need to invoke separate coding pass functions for this mode. 

♦ The arithmetic coder is also terminated and restarted at the beginning of each sub-block significance 
coding pass using the algorithm outlined in Section II-2. 

We currently envisage these modifications (sub-block causal context formation, restarting the arithmetic 
coder and reinitializing the context states at each coding pass, and markers to assist with parallel decoding) 
as options for the bit-stream since they each involve some sacrifice in compression performance and the 
reward can only be realized in certain hardware implementations. Nevertheless, the modifications 
represent minor departures from the first member of the proposed family of entropy coders, i.e. that 
obtained by applying all the modifications discussed in Sections II- 1 and II-2. For convenience, we will 
identify the option to reset the context states and restart the arithmetic coder at each coding pass boundary 
as "-Crestart". Before providing some experimental results, we discuss the implications of these optional 
modifications. 

II-3.1 Implications for Parallelism 

Here we briefly consider some of the options for parallel decoding imparted by the sub-block causal coding 
context option. Parallel encoding relies only on the fact that the arithmetic coder is restarted on coding pass 
boundaries (i.e. "-Crestart"), although a parallel implementation might be substantially simplified by the 
sub-block causal coding contexts since then the encoder and decoder implementations can be virtually 
identical. Since the sub-blocks do not affect each other in a non-causal manner, it is clear that we can run 
parallel arithmetic coding engines in each sub-block, provided they are appropriately synchronized. 
Specifically, when the engine associated with the n'th sub-block in the scan completes a coding pass, it 
transfers its state and boundary conditions to the engine associated with the (n+l)'th sub-block in the scan 
which then performs the same coding pass, while the engine associated with the n'th sub-block receives 
new state and boundary information from the engine for the (n-l)'th sub-block in the scan. This is only one 
of a number of parallel implementation options, but is perhaps the easiest to understand. It should be noted 
that a parallel decoder must decode the sub-block significance information for all bit-planes (or all those to 
be executed in parallel) in a sequential manner before the parallel decoding of the sub-bitplane passes can 
commence. This presents no problem since the sub-block significance coding process is remarkably simple 
and operates on vastly less data than the regular coding passes. 



11-3.2 Opportunities for Memory Reduction 

In applications where external memory and memory bandwidth are at a premium and the application 
supplies or consumes the image in a line-by-line fashion, perhaps the most reasonable approach is to use a 
"K'Mine transform implementation which produces K lines of each subband in any given resolution level in 
a swath, storing LL band samples back to memory until enough have been accumulated to run the 
transform over the same "K'Mine transform over the LL band and so on in a recursive manner, exactly as 
explained in [WG1N1020R]. Larger values of K imply lower overall memory access bandwidth, since the 
cost of accessing the samples in the filter's vertical region of support is amortized over the larger number of 
lines being produced simultaneously. When working with the generic block coding engine, the most 
memory and bandwidth efficient solution is to set K equal to the block height (e.g. 32 or 64 rows - note 
that code-blocks need not be square), as discussed carefully in [WG1N1020R]. It is possible to reduce the 
value of K down to the height of a single sub-block (say 8 or 16 rows), without substantially increasing 
memory bandwidth, provided the sub-bitplane coding passes use independent probability models ("- 
Crestart"), as also advocated in [WG1N1201]. This is by no means easy to understand, particularly since a 
variety of different implementations are enabled. The basic idea is that the coder would process a row of 
sub-blocks within the current row of code-blocks, within any given pass of the K-line transform, where K is 
set equal to the block height (it is also possible to set K to any multiple of the sub-block height, but we will 
not consider that here). This is illustrated roughly in the figure below. All sub-bitplane coding passes must 
be implemented in parallel (at least conceptually) and at each code-block boundary (in the horizontal 
direction), the state of these coding engines must be flushed out to memory, to be retrieved later in the next 
K-line pass, where the next row of sub-blocks in the same set of horizontally adjacent code-blocks is 
visited. In this process, the significance of the various sub-blocks can be saved to external memory along 
with the other context and arithmetic coder state information, on horizontal block boundaries, so that the 
sub-block significance coding process can be executed later, after all sub-block high scans of the code- 
block have been completed. At this point, the embedded bit-stream for the code-block is also pieced 
together, from the disjoint pieces corresponding to each coding pass and each sub-block significance scan. 
In the decoder, the process is reversed. 




Figure 1 A row of code-blocks, coded in increments ofK lines at a time, where K 
is the sub-block height. In the figure, there are only 4x4 sub-blocks per code- 
block. The state of the arithmetic coders for each sub-bitplane pass must all be 
flushed out and retrieved at the horizontal boundary of each code-block and the 
significance of each sub-block must also be saved externally, although the latter 
involves negligible memory and bandwidth. 

Of course, this process of maintaining, flushing and retrieving large amounts of state information on 
horizontal code-block boundaries is by no means simple and might prove prohibitive in many applications. 
Nevertheless, the possibility exists to reduce external memory requirements in this way. In practice, it is 
most likely, that memory savings will be achieved, if desired, by code-blocks which arc only a single sub- 
block high (e.g. 8 or 16 rows) and comparatively wide (e.g. 128 or 256 columns). In this case there is no 
need to flush the state of many parallel arithmetic coding engines to external memory and the entire code- 
block bit-stream can be assembled on-chip, which is conceptually and practically very much easier. In this 
case, one might still choose to use the parallel options (independent probability models for each coding 



pass and sub-block causal context formation) even though it has no effect on external memory 
requirements, in order to reduce on-chip memory requirements. Specifically, with independent probability 
models for each coding pass and sub-block causal context formation, it is possible to avoid buffering up the 
code-block samples beyond the width of a sub-block, which is generally much less than the width of a 
code-block, particularly if very wide blocks are used to compensate for reductions in block height down to 
that of a single sub-block. 

It is unclear whether the parallel options would ever be used in practice solely to reduce external or internal 
memory requirements. Rather, these options are of use primarily for achieving microscopic parallel 
encoding and/or decoding for "sample-per-clock" applications in which very high throughput is essential. 
When used in this way, the memory savings may be seen as an added bonus, particularly when working 
with wide code-blocks whose height is only that of a single sub-block, since in that case the substantial 
state information need not be repeatedly stored and retrieved from external memory, as mentioned above. 
It should also be noted that the memory savings will not be available if the image is to be transposed for 
one reason or another by decoding and inverse transforming (or forward transforming and encoding) code- 
blocks in a column by column, rather than row-by-row fashion. They are also not available in the so-called 
"block-based" applications which have been strongly advocated for digital camera applications. This may 
be important to some envisaged printing mechanisms. 

77-3.3 Performance Loss 

Because the size of the sub-block affects the degree of parallelism and the potential for external memory 
savings, we examine the performance penalty associated with the parallel options here for both 16x16 sub- 
blocks and 8x8 sub-blocks. In both cases we use the default code-block size of 64x64 samples. To contain 
the size of this report (in pages, but especially in Mbytes) we present results only for the case in which sub- 
block causal contexts are combined with the "-Crestart" option. This is the most interesting case since it 
allows both parallel encoding and parallel decoding. To support parallel decoding we would also need to 
add markers into the bit-stream to identify the start of each coding pass, but since the cost of this is 
identical to the cost of adding markers for the "option" coder in VM4 and any other variation on the same 
theme, there is no need to explicitly report the cost of adding these markers. In fact, provided the sub-block 
causal contexts are used and the arithmetic coding process is restarted at the beginning of each coding pass, 
it is always possible to discard and later regenerate the markers without interfering with the rest of the bit- 
stream. The results for the two different sub-block sizes appear in Table 8 and Table 9. 
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Table 8 Comparison of the sequential coding performance with 
that associated with the parallel options. In both cases, the 
modifications described in Sections 77-7.7, 77-7.2, 77-7.3, 77- 
1.4 and II-2 are included. Here the default sub-block and 
code-block sizes of 16x16 and 64x64 respectively are used. 
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Table 9 Comparison of the sequential coding performance with 
that associated with the parallel options. In both cases, the 
modifications described in Sections II- 1.1, 11-1.2, 11-1.3, II- 
1.4 and 11-2 are included. Here the sub-block and code- 
block sizes are 8x8 and 64x64 respectively. 
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II-3A Comparison of Sub-Block Based vs. Line Based Block Coding 

At this point, it is worth comparing two somewhat different approaches to incorporating parallel 
encoding/decoding capabilities into the block-based entropy coder. The first approach, embodied in the 
" option" coder within VM4 and described in [WG1N1201], uses a line-based scan pattern within the code- 
block and line-causal context formation in order to enable parallel decoding. The second approach, 
described in this document, codes the samples sub-block by sub-block and uses sub-block causal context 
formation in order to enable parallel decoding. Table 10 and Table 11 compare the compression 
performance and number of symbols retrieved from the arithmetic coder per image pixel for the two 
approaches, using sub-block sizes of 16x16 and 8x8, respectively. Based on these results and the peculiar 
characteristics of the different approaches we may make the following comparative statements: 

1) The use of sub-blocks implies a significant reduction in the number of samples which must be visited 
and coded/decoded during the sub-bitplane coding passes, particularly at lower bit-rates. This in turn 
implies speed up and/or power reduction in software and hardware implementations. 

2) The use of sub-blocks reduces the adaptation overhead of the bitplane coding probability models 
somewhat, which may have a small positive impact on compression efficiency. Moreover, there is no 
need here to restrict the context formation to vertically causal contexts in the sub-block based 
approach. The only modification to contexts is at the lower and right hand sub-block boundaries, 
which affects only a small proportion of the samples, particularly for larger sub-block sizes. 
Consequently, loss of compression performance due to the restriction to causal contexts is reduced. 
The combination of these two effects is evidenced by the results which indicate that the sub-block 
based approach loses only about half as much (in dB) relative to the original EBCOT coder in VM4 as 
does the line based approach of the "option" coder. For 8x8 sub-blocks, performance is closer to that 
of the line based approach, but the reduction in symbol count is also larger — about 30% at bit-rates of 
interest. 

3) If the parallel options are used to reduce memory consumption, in one of the ways explained in the 
previous section, then there is no need to buffer K lines of samples across the full code-block width 
before anything can be coded as is the case with line-based block coding. Instead, it is sufficient to 
buffer samples for only the width of a single sub-block on-chip, which can represent a substantial 
saving, particularly when working in the most realistic memory saving mode in which the code-blocks 
are only a single sub-block high, but relatively wide (say 128 or 256 columns). For this most realistic 
case, the "option" coder in VM4 provides an alternative "column-by-column" scanning option for 
precisely this reason, whereas the sub-block based coder described in this document accomplishes a 
similar effect without introducing a different scanning pattern. 

4) A disadvantage of the sub-block based approach is that it restricts the opportunity to realize reduced 
external memory implementations to the cases in which the number of lines processed at a time is a 
multiple of the sub-block height. As already mentioned, this may not be a very highly utilized 
capability for a variety of reasons. Also, a sub-block height of 8 would probably be small enough to 
achieve most of the useful gain from such schemes without substantially increasing the memory 
bandwidth. 



Table 10 Comparison of the line-based "option" coder in VM4 
(with x -CcausaV) with the sub-block based coder with sub- 
block causal context formation and the corresponding 
parallel options. Here the sub-block size is 16x16 and the 
code-block size is 64x64 in both cases. 
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Table 1 1 Comparison of the line-based "option" coder in VM4 
(with ^-CcausaV) with the sub-block based coder with sub- 
block causal context formation and the corresponding 
parallel options. Here the sub-block size is 8x8 and the 
code-block size is 64x64 in both cases. 
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II-4 A Lazy Coding Mode 

In Section II-3, we concentrated on achieving properties such as parallel encoding/decoding while 
minimizing the average number of symbols to be coded/decoded. In this section we investigate one final 
modification, which again we propose as an option, which both reduces the average number of 
arithmetically coded symbols and also reduces the maximum number of coding passes in which symbols 
might need to be arithmetically coded, which can be of advantage in simplifying hardware 
implementations. The idea is very simple and is connected with techniques used in Ricoh's "CREW" 
algorithm. Specifically, we begin by observing that the probability models for many coding contexts attain 
distributions close to uniform in the least significant bit-planes. It is a waste of effort to use the arithmetic 
coding engine to code these binary symbols; instead, we prefer to send them as raw binary digits. Although 
it is difficult to interleave raw binary digits into an arithmetically coded bit-stream, it is possible to bypass 
the arithmetic coding engine altogether for an entire sub-bitplane coding pass provided the arithmetic coder 
is terminated at the end of the previous coding pass (using the Elias termination which allows for unique 
decoding with arbitrary bit-stream suffices) and restarted at the beginning of the next pass which requires 
arithmetic coding. This is a subset of the behaviour offered by the "-Crestart" option and for our 
experiments so the software implementation supports the behaviour both with and without the costly 
reinitialization of coding context states which goes with the "-Crestart" mode. For our experiments, 
however, we prefer to use this "lazy" mode only in conjunction with "-Crestart" and sub-block causal 
contexts for reasons which will be explained shortly. To be specific, in the proposed, optional 
modification, all of the binary symbols generated in the "significance propagation" and "magnitude 
refinement" coding passes (see Section II- 1.2) representing bits in bit-planes p < p Q —K are written 
directly into the bit-stream as raw binary digits, entirely bypassing the arithmetic coder, where p 0 denotes 
the most significant bit-plane in which any sample in the relevant code-block becomes significant and K is 
a parameter which we suggest should be set to K = 3 . It should be pointed out that an additional 
modification to the Elias termination procedure was required in order to ensure that this "lazy" mode could 
be used in conjunction with the MQ coder. Specifically, since the MQ coder is byte-oriented, with a bit- 
stuffing rather than carry propagation policy for dealing with carry generation at the encoder, the arbitrary 
bit-stream suffices which can be generated by the emission of raw uncoded bits can generate illegal bit- 
stream for a previous MQ-coded pass. To avoid this difficulty, we modified the Elias termination 
implementation to allow for truly arbitrary suffices; the details of this modification are not warranted by the 
scope of this document, but they are adequately described by comments in the source code. 
One advantage of the modification is that it substantially reduces the number of symbols which must be 
arithmetically coded at high bit-rates. Also, since we usually encode all code-blocks in an image at a high 
rate before truncating down to a final target bit-rate, this scheme substantially reduces the number of 
symbols which must typically be encoded and hence reduces the encoding time. We find, for example, that 
CPU times for reversible compression are typically reduced by 30%. On the other hand, the modification 
has relatively little effect on compression performance. 

The second advantage of the modification is that it substantially reduces the maximum number of coding 
passes in which arithmetic coding might need to be used. Without the modification, the maximum number 
of coding passes for any given code-block is 3/^ — 2 where is the maximum number of bit-planes 

in any given subband and might be on the order of 12 for the lower frequency subbands. On the other 
hand, with the modification, the maximum number of coding passes for any given code-block is 
^max + = ^max + 6 . In applications where we intend to use microscopic parallelism to achieve 
"sample-per-clock" throughput, this means a substantial reduction in the number of parallel arithmetic 
coding engines which must be included on the chip. Due to the importance of the combination of this 
mode with parallel encoding/decoding, we provide results in Table I for the combination of this so-called 
"lazy" option with the "-Crestart" option and sub-block causal coding contexts, comparing the performance 
with the use of sub-block causal coding contexts and "-Crestart" alone. 

Table 12 Comparison of the "lazy coding" option in 
combination with the parallel options with the parallel 



options alone, for a sub-block size of 16x16 and a code- 
block size of 64x64. 
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To conclude this section, it is worthwhile comparing the lossless performance associated with the various 
different modifications and options which have been advanced. Table 2 provides this comparison in terms 
of lossless bit-rate and total number of arithmetically coded symbols per sample, for five different 



algorithms. 



Table 13 Comparison of lossless coding performance for five 
different algorithms (mode variations) using 64x64 code- 
blocks and 16x16 sub-blocks where applicable with the 5/3 
default reversible Wavelet kernel The first pair of columns 
refer to VM4; the second pair refer to the coder obtained by 
applying the modifications described in Sections II- 1 and II- 

2; the third pair of columns is obtained by adding the 
parallel options ( ,f -Crestart" and sub-block causal context 
formation); the fourth pair of columns are obtained by 
adding the "lazy" coding option; and the last pair of columns 
are obtained using the "option" coder from VM4 with 
Ccausal". 
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Some of the interesting points to observe are as follows: 

♦ The "lazy" coding mode requires far fewer (usually less than half as many) symbols to be coded than 
any of the other modes. 

♦ The "lazy" coding mode generates a lower compressed bit-rate than that obtained with the same 
parallel options but the lazy mode turned off. Since the lazy mode affects only the significance 
propagation and magnitude refinement coding passes all of whose coding contexts are initialized to the 
MQ coder's standard initial state at the beginning of the "learning curve", this result indicates that the 
relevant distributions must be so close to uniform that emitting raw binary digits is more efficient than 
letting the arithmetic coder learn this uniform distribution. 

♦ The modifications to the original EBCOT algorithm described in Sections II-l and II-2 have negligible 
effect on lossless performance. 

♦ The "option" coder in VM4 codes slightly more symbols and compresses slightly less efficiently than 
all the other modes. 



in CPU Times for Various Modes of Interest 

In this section we provide CPU decoding times for each different bit-rate and each of the JPEG2000 images 
we have been considering for a number of algorithm modes. The CPU times are all obtained using a 400 
MHz Pentium Pro (perhaps the most appropriate target platform for initial consumer applications of 
JPEG2000). To reduce the substantial jitter in CPU timing results, we added a loop into the 
implementation of "decode_biock" which iterates five (5) times through the block decoding process 
between calls to the standard ANSI "C" library function, "clockO", which is used to measure the CPU time. 
The same modification was made to the VM4.1 source code for both the "option" coder and the original 
EBCOT variation, so as to ensure that the results can be reliably compared. In this section we work with 
the latest version of the VM4.1 source code for comparisons with existing entropy coder options. 



The most interesting results are distilled in the four tables which follow. The captions for these tables are 
self explanatory and so we simply summarize here some of the conclusions which may be drawn from 
these results: 

♦ The proposed modifications to the EBCOT coder in VM4.1 yield improvements in software execution 
speed on the order of 15% with negligible loss in compression performance. 

♦ If 8x8 sub-blocks are used instead of 16x16, performance degrades very little while software execution 
speed increases by about 6 or 7% (more at lower bit-rates). 

♦ With 8x8 sub-blocks the modified EBCOT coder runs about 15% faster than the "option" coder in 
VM4. 1 , with slight improvements in compression performance, when operated in comparable modes. 

♦ The "lazy" coding option yields another 10% improvement in software execution speed at moderate 
bit-rates, while introducing a slight loss in compression efficiency. At least in the case where the 
parallel options are enabled, the "lazy" coding option to the modified EBCOT coder yields almost 
identical compression performance to the "option" coder in VM4.1 with a speed-up of about 25%. The 
speed-up factor increases rapidly at very high bit-rates without any further loss in compression 
efficiency. 



Table 14 Comparison of CPU times (and PSNR) between VM4.I 
and the EBCOT coder modified as described in Sections II 
and II-2 t with 64x64 code-blocks and 16x16 sub-blocks in 
both cases. 
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Table 15 Comparison of CPU times (and PSNR) for the 
modified EBCOT coder described in Sections II and II~2> 



using 16x16 and 8x8 sub-blocks, respectively, with 64x64 
code-blocks in both cases. 
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Table 16 Comparison of CPU times (and PSNR) between the 
VM4.J "option 11 coder with "-Ccausal" and the modified 
EBCOT coder with comparable parallel options, "-Crestart' 
and sub-block causal contexts. The sub-block size in this 
case was set to 8x8. Code-blocks are 64x64 in both cases. 
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Table 17 Comparison of CPU times (and PSNR) between the 
modified EBCOT coder with parallel options, "-Crestart" 



and sub-block causal contexts, and the "lazy" coding mode 
with the same parallel options. Sub-block and code-block 
dimensions are 8x8 and 64x64, respectively. 
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TV Results Reported in Spreadsheet • • v 

The spreadsheets supplied by Alan Chien will be filled out for the following modes to provide a larger set 
of test results under a subset of the conditions for which partial results are reported in this document: 



• VM4.1 EBCOT coder 

• VM4.1 "option" coder with "-Ccausal" 

• Combined modifications described in Sections II- 1 and II-2. 

• Parallel option ("-Crestart" and sub-block causal contexts, 8x8 sub-blocks) 

V CONCLUSIONS AND RECOMMENDATIONS ~ 

From the material presented in this document, the following conclusions may be drawn. 

• The modifications to the original EBCOT algorithm described in Section II- 1 have negligible 
effect on compression performance, while simplifying the implementation of the algorithm. 
We recommend that they be adopted forthwith. 

• The alternative sub-block significance coder described in Section II-2 leads to very minor 
improvements in compression performance. We recommend that it be adopted on the basis 
that it avoids some of the controversial issues currently surrounding the use of quad-tree 
techniques (hopefully, this situation will change) and also because, unlike quad-tree coding, 
its coding efficiency does not rely upon the fact that the number of sub-blocks in the width 
and height of the code-block are both approximately equal to 2" for some integer, n. Thus, 
for elongated blocks we expect the quad-tree technique to suffer, although we have not had 
time to investigate this. 

• The options described in Section II-3 introduce the possibility of exploiting "microscopic 
parallelism" in hardware implementations, and perhaps also reducing the external memory 
consumption in some applications. Because these options hurt performance somewhat and 
are not of interest to all applications, we recommend that they be adopted as options, rather 
than as mandatory. These options accomplish the same objectives as the existing "option" 
coder in VM4 while substantially reducing the number of symbols which must be coded, 
increasing execution speed of software implementations, and yielding somewhat higher 
compression performance. In fact, the EBCOT coder in VM4, the modifications and options 
proposed here and the "option" coder in VM4 are all very close relatives of one another. 
Based on the experimental evidence provided here, there does not appear to be any need to 
preserve multiple distinct entropy coders in the VM; the "option" coder should be removed 
and the modifications suggested in this document, many of which utilize similar or identical 
principles to the "option" coder, should be included instead. 

• The "lazy" coding mode described in Section Error! Reference source not found, introduces the 
possibility of very substantial reductions in the number of symbols which must be coded at higher bit- 
rates and, especially, in lossless compression applications. Performance is degraded relatively little at 
moderate bit-rates and lossless compression is actually improved. For the moment, we recommend the 
adoption of this variation as an option. 

• Based on the results presented in this report, there is reason to suggest that we might standardize on a 
sub-block size of 8x8, rather than the current default sub-block size of 16x16. Performance is reduced 
very little and software execution time and symbol count are somewhat reduced by the use of 8x8 sub- 
blocks. Moreover, 8x8 sub-blocks almost certainly lend themselves to more compact, less expensive 
hardware implementations. The hardware implementation suggestions in WG1N1020R for exploiting 
the compactness of the sub-block scan would obviously benefit from the use of smaller sub-blocks. 
Finally, smaller sub-blocks enhance the opportunities for parallelism. 



